menu
arrow_back

MapReduce in Dataflow (Python)

search help
Help

MapReduce in Dataflow (Python)

2 hours Free

Overview

Duration is 1 min

In this lab, you learn how to use pipeline options and carry out Map and Reduce operations in Dataflow.

What you need

You must have completed Lab 0 and have the following:

  • Logged into GCP Console with your Qwiklabs generated account

What you learn

In this lab, you learn how to:

  • Use pipeline options in Dataflow

  • Carry out mapping transformations

  • Carry out reduce aggregations

Introduction

Duration is 1 min

The goal of this lab is to learn how to write MapReduce operations using Dataflow.

Setup

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Make sure you signed into Qwiklabs using an incognito window.

  2. Note the lab's access time (for example, img/time.png and make sure you can finish in that time block.

  1. When ready, click img/start_lab.png.

  2. Note your lab credentials. You will use them to sign in to the Google Cloud Console. img/open_google_console.png

  3. Click Open Google Console.

  4. Click Use another account and copy/paste credentials for this lab into the prompts.

  1. Accept the terms and skip the recovery resource page.

Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

In the Cloud Console, in the top right toolbar, click the Activate Cloud Shell button.

Cloud Shell icon

Click Continue.

cloudshell_continue.png

It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:

Cloud Shell Terminal

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

You can list the active account name with this command:

gcloud auth list

(Output)

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)

(Example output)

Credentialed accounts:
 - google1623327_student@qwiklabs.net

You can list the project ID with this command:

gcloud config list project

(Output)

[core]
project = <project_ID>

(Example output)

[core]
project = qwiklabs-gcp-44776a13dea667a6

Launch Google Cloud Shell Code Editor

Use the Google Cloud Shell Code Editor to easily create and edit directories and files in the Cloud Shell instance.

Once you activate the Google Cloud Shell, click the Open editor button to open the Cloud Shell Code Editor.

open_editor.png

You now have three interfaces available:

  • The Cloud Shell Code Editor
  • Console (By clicking on the tab). You can switch back and forth between the Console and Cloud Shell by clicking on the tab.
  • The Cloud Shell Command Line (By clicking on Open Terminal in the Console)

cloud_shell_code_editor.png

Check project permissions

Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (nav-menu.png), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Home.

check-sa.png

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  • In the Google Cloud console, on the Navigation menu, click Home.

  • Copy the project number (e.g. 729328892908).

  • On the Navigation menu, click IAM & Admin > IAM.

  • At the top of the IAM page, click Add.

  • For New members, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  • For Role, select Project (or Basic) > Editor. Click Save.

add-sa.png

Identify Map and Reduce operations

Duration is 5 min

Step 1

In CloudShell clone the source repo which has starter scripts for this lab:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Then navigate to the code for this lab.

cd training-data-analyst/courses/data_analysis/lab2/python

Step 2

Click on the Refresh icon.

View the source code for is_popular.py for the pipeline using the Cloud Shell in-browser editor or with the command line using nano:

code-editor.png

nano is_popular.py

Step 3

What custom arguments are defined? ____________________

What is the default output prefix? _________________________________________

How is the variable output_prefix in main() set? _____________________________

How are the pipeline arguments such as --runner set? ______________________

Step 4

What are the key steps in the pipeline? _____________________________________________________________________________

Which of these steps happen in parallel? ____________________________________

Which of these steps are aggregations? _____________________________________

Execute the pipeline

Duration is 2 min

Step 1

Install the necessary dependencies for Python dataflow:

sudo ./install_packages.sh

Verify that you have the right version of pip (should be > 8.0):

pip3 -V

If not, open a new CloudShell tab and it should pick up the updated pip.

Step 2

Run the pipeline locally:

python3 ./is_popular.py
Note: If you see an error that says "No handlers could be found for logger "oauth2client.contrib.multistore_file", you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.

Step 3

Examine the output file:

cat /tmp/output-*

Use command line parameters

Duration is 2 min

Step 1

Change the output prefix from the default value:

python3 ./is_popular.py --output_prefix=/tmp/myoutput

What will be the name of the new file that is written out?

Step 2

Note that we now have a new file in the /tmp directory:

ls -lrt /tmp/myoutput*

What you learned

Duration is 1 min

In this lab, you:

  • Used pipeline options in Dataflow
  • Identified Map and Reduce operations in the Dataflow pipeline

End your lab

When you have completed your lab, click End Lab. Qwiklabs removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

©2020 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.